1 Textual Search in Graphics Stream of PDF
نویسنده
چکیده
Digitized books and manuscripts in digital libraries are often stored as images or graphics. They are not searchable at the content level due to the lack of OCRs or poor quality of the scanned images. Portable Document Format (PDF) has emerged as the most popular document representation schema for wider access across platforms. When there is no textual (UNICODE, ASCII) representation available, scanned images are stored in the graphics stream of PDF. In this paper, we propose a novel solution to search the textual data in graphics stream of the PDF files at content level. The proposed solution is demonstrated by enhancing an open source PDF viewer (Xpdf). Indian language support is also provided. Users can type a word in Roman (ITRANS), view it in a font, and search in textual and graphics stream of PDF documents simultaneously.
منابع مشابه
Towards High-Quality Text Stream Extraction from PDF. Technical Background to the ACL 2012 Contributed Task
Extracting textual content and document structure from PDF presents a surprisingly (depressingly, to some, in fact) difficult challenge, owing to the purely display-oriented design of the PDF document standard. While a variety of lower-level PDF extraction toolkits exist, none fully support the recovery of original text (in reading order) and relevant structural elements, even for so-called bor...
متن کاملA Comparison of Tabular PDF Inversion Methods
The most common form of tabular inversion used in computer graphics is to compute the cumulative distribution table of a pdf and then search within it to transform points, using an O(logn) binary search. Besides the standard inversion method, however, several other discrete inversion algorithms exist that can perform the same transformation in O(1) time per point. In this paper, we examine the ...
متن کاملIdentifying and Ranking the Important Textual and Paratextual Elements in Fiction Retrieval
Purpose: The purpose of this study is to identify the textual and paratextual elements in retrieving fiction from the readers’ perspective in order to provide the most appropriate access points for the readers and to improve access to fictions based on the readers’ needs. Method: The current research is an applied study in terms of purpose, applying a mixed method that was conducted using the ...
متن کاملXHTML and SVG: Publishing with concept
Electronic Publishing with tools from the Extensible Markup Language (XML) family of technologies has been increasingly used since the first XML and Extensible Style Sheet Language Transformation (XSLT) specifications were been published in 1998/1999 and supporting processing applications emerged. This paper describes ideas and solutions of how to migrate the existing electronic publishing proc...
متن کاملCombining Visual and Textual Features for Information Extraction from Online Flyers
Information in visually rich formats such as PDF and HTML is often conveyed by a combination of textual and visual features. In particular, genres such as marketing flyers and info-graphics often augment textual information by its color, size, positioning, etc. As a result, traditional text-based approaches to information extraction (IE) could underperform. In this study, we present a supervise...
متن کامل